home *** CD-ROM | disk | FTP | other *** search
Text File | 1996-03-02 | 8.0 KB | 177 lines | [TEXT/ALFA] |
- NOTE: The following is the manual page supplied with the regular expression
- routines used both in Alpha and Tcl. - pete.
-
-
-
- NAME
- regcomp, regexec, regsub, regerror - regular expression
- handler
-
- SYNOPSIS
- #include <regexp.h>
-
- regexp *regcomp(exp)
- char *exp;
-
- int regexec(prog, string)
- regexp *prog;
- char *string;
-
- regsub(prog, source, dest)
- regexp *prog;
- char *source;
- char *dest;
-
- regerror(msg)
- char *msg;
-
- DESCRIPTION
- These functions implement egrep(1)-style regular expressions
- and supporting facilities.
-
- Regcomp compiles a regular expression into a structure of
- type regexp, and returns a pointer to it. The space has
- been allocated using malloc(3) and may be released by free.
-
- Regexec matches a NUL-terminated string against the compiled
- regular expression in prog. It returns 1 for success and 0
- for failure, and adjusts the contents of prog's startp and
- endp (see below) accordingly.
-
- The members of a regexp structure include at least the
- following (not necessarily in order):
-
- char *startp[NSUBEXP];
- char *endp[NSUBEXP];
-
- where NSUBEXP is defined (as 10) in the header file. Once a
- successful regexec has been done using the regexp, each
- startp-endp pair describes one substring within the string,
- with the startp pointing to the first character of the
- substring and the endp pointing to the first character
- following the substring. The 0th substring is the substring
- of string that matched the whole regular expression. The
- others are those substrings that matched parenthesized
- expressions within the regular expression, with
- parenthesized expressions numbered in left-to-right order of
- their opening parentheses.
-
- Regsub copies source to dest, making substitutions according
- to the most recent regexec performed using prog. Each
- instance of `&' in source is replaced by the substring
- indicated by startp[0] and endp[0]. Each instance of `\n',
- where n is a digit, is replaced by the substring indicated
- by startp[n] and endp[n]. To get a literal `&' or `\n' into
- dest, prefix it with `\'; to get a literal `\' preceding `&'
- or `\n', prefix it with another `\'.
-
- Regerror is called whenever an error is detected in regcomp,
- regexec, or regsub. The default regerror writes the string
- msg, with a suitable indicator of origin, on the standard
- error output and invokes exit(2). Regerror can be replaced
- by the user if other actions are desirable.
-
- REGULAR EXPRESSION SYNTAX
- A regular expression is zero or more branches, separated by
- `|'. It matches anything that matches one of the branches.
-
- A branch is zero or more pieces, concatenated. It matches a
- match for the first, followed by a match for the second,
- etc.
-
- A piece is an atom possibly followed by `*', `+', or `?'.
- An atom followed by `*' matches a sequence of 0 or more
- matches of the atom. An atom followed by `+' matches a
- sequence of 1 or more matches of the atom. An atom followed
- by `?' matches a match of the atom, or the null string.
-
- An atom is a regular expression in parentheses (matching a
- match for the regular expression), a range (see below), `.'
- (matching any single character), `^' (matching the null
- string at the beginning of the input string), `$' (matching
- the null string at the end of the input string), a `\'
- followed by a single character (matching that character), or
- a single character with no other significance (matching that
- character).
-
- A range is a sequence of characters enclosed in `[]'. It
- normally matches any single character from the sequence. If
- the sequence begins with `^', it matches any single
- character not from the rest of the sequence. If two
- characters in the sequence are separated by `-', this is
- shorthand for the full list of ASCII characters between them
- (e.g. `[0-9]' matches any decimal digit). To include a
- literal `]' in the sequence, make it the first character
- (following a possible `^'). To include a literal `-', make
- it the first or last character.
-
- AMBIGUITY
- If a regular expression could match two different parts of
- the input string, it will match the one which begins
- earliest. If both begin in the same place but match
- different lengths, or match the same length in different
- ways, life gets messier, as follows.
-
- In general, the possibilities in a list of branches are
- considered in left-to-right order, the possibilities for
- `*', `+', and `?' are considered longest-first, nested
- constructs are considered from the outermost in, and
- concatenated constructs are considered leftmost-first. The
- match that will be chosen is the one that uses the earliest
- possibility in the first choice that has to be made. If
- there is more than one choice, the next will be made in the
- same manner (earliest possibility) subject to the decision
- on the first choice. And so forth.
-
- For example, `(ab|a)b*c' could match `abc' in one of two
- ways. The first choice is between `ab' and `a'; since `ab'
- is earlier, and does lead to a successful overall match, it
- is chosen. Since the `b' is already spoken for, the `b*'
- must match its last possibility-the empty string-since it
- must respect the earlier choice.
-
- In the particular case where no `|'s are present and there
- is only one `*', `+', or `?', the net effect is that the
- longest possible match will be chosen. So `ab*', presented
- with `xabbbby', will match `abbbb'. Note that if `ab*' is
- tried against `xabyabbbz', it will match `ab' just after
- `x', due to the begins-earliest rule. (In effect, the
- decision on where to start the match is the first choice to
- be made, hence subsequent choices must respect it even if
- this leads them to less-preferred alternatives.)
-
- In addition, \w matches an alphanumeric character (including "_")
- and \W a nonalphanumeric. Word boundaries may be matched by \b,
- and non-boundaries by \B. A whi- tespace character is matched by
- \s, non-whitespace by \S. A numeric character is matched by \d,
- non-numeric by \D. You may use \w, \s and \d within character
- classes.
-
- The class of character recognized by \w (and hence not recognized
- by \W), can be augmented via the 'addAlphaChars' command.
-
- DIAGNOSTICS
- Regcomp returns NULL for a failure (regerror permitting),
- where failures are syntax errors, exceeding implementation
- limits, or applying `+' or `*' to a possibly-null operand.
-
- HISTORY
- Both code and manual page were written at U of T. They are
- intended to be compatible with the Bell V8 regexp(3), but
- are not derived from Bell code.
-
- BUGS
- Empty branches and empty regular expressions are not
- portable to V8.
-
- The restriction against applying `*' or `+' to a possibly-
- null operand is an artifact of the simplistic
- implementation.
-
- Does not support egrep's newline-separated branches; neither
- does the V8 regexp(3), though.
-
- Due to emphasis on compactness and simplicity, it's not
- strikingly fast. It does give special attention to handling
- simple cases quickly.
-